Problem Statement¶

This report serves as my final project for the CQF program.

The objective is to develop a model that predicts daily upward movements in the stock price of Adobe (ticker: ADBE), using Long Short-Term Memory (LSTM) networks. The choice of Adobe pays tribute to my early career as a graphic designer, during which Adobe tools accompanied me through one of the most creative periods of my life.

The target variable is a binomial classification labeled as $[0, 1]$, where $1$ indicates a positive price movement. The model will be trained, evaluated, and tested on a five-year dataset spanning from January 1, 2020 to July 1, 2025.

The report highlights the project workflow, covering decision flows, mathematical rationale, and analytical insights. For the full executable code, please refer to DL Tuo Li CODE/DL Tuo Li CODE.ipynb.

Content¶

1. Preparation¶

2. Feature Engineering¶

  • 2.1 Define the target
    • 2.1.1 Decide the threshold
    • 2.1.2 Check class imbalance
  • 2.2 Generate features
    • 2.2.1 Features from trading data
      • 2.2.1.1 Weekday transformation
      • 2.2.1.2 Technical features from trading data
    • 2.2.2 Features from other resources
    • 2.2.3 Features from macro environment
      • 2.2.3.1 Features from QQQ
      • 2.2.3.2 Features from macro economy

3. Exploratory Data Analysis (EDA)¶

  • 3.1 Structural evaluation
    • 3.1.1 Handle the missing values
    • 3.1.2 Create the feature X and the target y
    • 3.1.3 Explore feature distribution and analyze outliers
      • 3.1.3.1 Visualize the features
      • 3.1.3.2 Group the features
      • 3.1.3.3 Scale the features
  • 3.2 SHAP analysis and feature relationship exploration
    • 3.2.1 SHAP analysis
    • 3.2.2 Explore relationships among high-impact features
  • 3.3 Analyze multi-collinearity and reduce dimensionality
    • 3.3.1 VIF analysis and correlation heatmap
    • 3.3.2 Reduce dimensionality
      • 3.3.2.1 Cluster-based selection
      • 3.3.2.2 Correlation-based selection

4. Model Building¶

  • 4.1 Prepare dataset
    • 4.1.1 Split and scale the dataset
    • 4.1.2 Create data generator
  • 4.2 Baseline model - 2 layer LSTM model without dropout
    • 4.2.1 Build the model (2 layers without dropout)
    • 4.2.2 Evaluate the model (2 layers without dropout)
      • 4.2.2.1 Evaluate the training data against the testing data
      • 4.2.2.2 Use the model to generate predictions on the testing data and conduct a more comprehensive performance analysis
      • 4.2.2.3 Baseline model evaluation summary
  • 4.3 Variant model A - 2 layer LSTM model with dropout
    • 4.3.1 Build the model (2 layers with dropout)
    • 4.3.2 Evaluate the model (2 layers with dropout)
      • 4.3.2.1 Evaluate the training data against the testing data
      • 4.3.2.2 Use the model to generate predictions on the testing data and conduct a more comprehensive performance analysis
      • 4.3.2.3 Model evaluation summary
  • 4.4 Variant model B - 3 layer LSTM model without dropout
    • 4.4.1 Build the model (3 layers without dropout)
    • 4.4.2 Evaluate the model (3 layers without dropout)
      • 4.4.2.1 Evaluate the training data against the testing data
      • 4.4.2.2 Use the model to generate predictions on the testing data and conduct a more comprehensive performance analysis
      • 4.4.2.3 Baseline model evaluation summary
  • 4.5 Variant model C - 3 layer LSTM model with dropout
    • 4.5.1 Build the model (3 layers with dropout)
    • 4.5.2 Evaluate the model (3 layers with dropout)
      • 4.5.2.1 Evalute the training data against the testing data
      • 4.5.2.2 Use the model to generate predictions on the testing data and conduct a more comprehensive performance analysis
      • 4.5.2.3 Model evaluation summary
  • 4.6 Review of all the models

5. Trading strategy with backtesting¶

  • 5.1 Profit analysis
  • 5.2 Sharpe ratio analysis
  • 5.3 Underwater curve analysis
  • 5.4 Pyfolio analysis

6. Conclusion¶

1. Preparation¶

In this project, all data is presented in tabular format.

The main dataset, reffered to as df, contains Adobe trading data from January 1, 2020 to July 1, 2025, sourced directly from MacroTrends.

Key characteristics of the dataset:

  • Index: date (one row per trading day)

  • Total rows: 1381 (approx. 5.5 years of trading data)

  • Columns: 6

    • open, high, low, close : float values representing price data

    • volume : integer representing trading volume

    • weekday : string indicating the trading day (Mon–Fri)

  • Data Quality : No missing values

Dataset preview (first 5 rows):

date open high low close volume weekday
2020-01-02 330.000 334.480 329.170 334.430 1990496 Thu
2020-01-03 329.170 332.980 328.690 331.810 1579368 Fri
2020-01-06 328.290 333.910 328.190 333.710 1875122 Mon
2020-01-07 334.150 334.790 332.305 333.390 2507261 Tue
2020-01-08 333.810 339.230 333.400 337.870 2248531 Wed

Statistical summary of the numerical columns:

count mean std min 25% 50% 75% max
open 1381 466.742 93.730 277.800 384.970 470.480 526.035 696.275
high 1381 472.594 93.930 279.590 390.130 475.867 533.510 699.540
low 1381 460.621 93.216 255.131 380.945 462.480 519.560 678.910
close 1381 466.773 93.617 275.200 385.710 469.730 526.940 688.370
volume 1381 3149150.357 1864608.743 589182 2104030 2660097 3582532 27840211

Back to Content

2. Feature Engineering¶

In this chapter, I will define the target of the project and create all the possible features with different techniques.

2.1 Define the target¶

2.1.1 Decide the threshold¶

Since this project focuses on predicting the direction of daily returns as a binomial classification $[0, 1]$, it's essential to establish a threshold for distinguishing between positive and negative returns.

Before determining a suitable classification threshold, we will explore the data. The close price is used to represent the daily price od Adobe. Let's observe how the values in close column was evolving in the past 5 years.

1_close_price

Over the past five years, Adobe's price has shown significant volatility with a slight upward drift.

To analyze this further, I will create a return column and examine the distribution of the daily return.

Main dataset with the return column (first 5 rows):

date open high low close volume weekday return
2020-01-02 330.000 334.480 329.170 334.430 1990496 Thu NaN
2020-01-03 329.170 332.980 328.690 331.810 1579368 Fri -0.008
2020-01-06 328.290 333.910 328.190 333.710 1875122 Mon 0.006
2020-01-07 334.150 334.790 332.305 333.390 2507261 Tue -0.001
2020-01-08 333.810 339.230 333.400 337.870 2248531 Wed 0.013

The distribution histogram of the returns:

2_return_distribution

As illustrated in the histogram, Adobe's daily returns exhibit a roughly normal distribution with a slight positive skew.

To optimize classification performance, I set the threshold at $0.2\%$ for the following reasons:

  • It approximately splits the dataset evenly, minimizing class imbalance and helping stabilize model training.

  • The primary objective is to generate meaningful predictions for positive market moves (class 1). A positive threshold ensures that class 1 predictions are significant if the model performs well, while also providing room to account for transaction costs in reality.

Under this target definition:

  • A value of $1$ is assigned when the next day's closing price is at least $0.2\%$ higher than the current day's close, indicating a potential buying opportunity. Otherwise, no action is taken.

  • Returns below $0.2\%$ are labeled as $0$.

Main dataset (first 5 rows) with the newly created target column which indicates whether the next day's return is greater than $0.2\%$ or not:

date open high low close volume weekday return target
2020-01-02 330.000 334.480 329.170 334.430 1990496 Thu NaN 0
2020-01-03 329.170 332.980 328.690 331.810 1579368 Fri -0.008 1
2020-01-06 328.290 333.910 328.190 333.710 1875122 Mon 0.006 0
2020-01-07 334.150 334.790 332.305 333.390 2507261 Tue -0.001 1
2020-01-08 333.810 339.230 333.400 337.870 2248531 Wed 0.013 1

2.1.2 Check class imbalance¶

With a threshold of $0.2\%$, in the target column, class 0 ($715$ samples) represents $51.77\%$ of the data, while class 1 ($666$ samples) represents $48.23\%$.

The data is nearly balanced, so we do not need to be concerned about target imbalance.

Back to Content

2.2 Generate features¶

2.2.1 Features from trading data¶

2.2.1.1 Weekday transformation¶

Weekday patterns can influence stock price behavior, so this temporal feature is included in our model.

Instead of using traditional one-hot encoding, which expands weekdays into five separate columns, we apply a trigonometric transformation using sine and cosine functions. This method captures the cyclical nature of weekdays, particularly the continuity between Friday and Monday, while adding only two numerical columns. This improves efficiency and preserves temporal structure.

The formulas for the new features dsin and dcos are: $$\text{dsin} = sin(\frac{2{\pi}*\text{num}}{7})$$ $$\text{dcos} = cos(\frac{2{\pi}*\text{num}}{7})$$ Where num is the numerical representation of the weekday:

  • $1$ : Monday

  • $2$ : Tuesday

  • $3$ : Wednesday

  • $4$ : Thursday

  • $5$ : Friday

This transformation ensures the model understands the cyclical flow of time without inflating the feature space.

Below is the updated dataset (first 5 rows) with the 2 new features. The original weekday column is removed.

date open high low close volume return target dsin dcos
2020-01-02 330.000 334.480 329.170 334.430 1990496 NaN 0 -0.434 -0.901
2020-01-03 329.170 332.980 328.690 331.810 1579368 -0.008 1 -0.975 -0.223
2020-01-06 328.290 333.910 328.190 333.710 1875122 0.006 0 0.782 0.623
2020-01-07 334.150 334.790 332.305 333.390 2507261 -0.001 1 0.975 -0.223
2020-01-08 333.810 339.230 333.400 337.870 2248531 0.013 1 0.434 -0.901

Back to Content

2.2.1.2 Technical features from trading data¶

Then, I will leverage the function of add_all_ta_features from ta library to automatically generate a wide range of technical indicators from the OHLCV (open, high, low, close, volume) data. This step efficiently enriches the dataset with 80+ indicators across the following categories:

  • Volume: OBV, CMF, MFI, NVI, ATR, etc.

  • Volatility: Bollinger Bands, Keltner Channels, Donchian Channels, etc.

  • Trend: SMA, EMA, MACD, ADX, Ichimoku, Parabolic SAR, Aroon, etc.

  • Momentum: RSI, Stochastic RSI, ROC, PPO, TSI, etc.

  • Others: daily returns, log returns, cumulative returns, etc.

After generation, further feature refinement steps are applied:

1. Remove redundant return feature
Since the daily return is already present in the dataset, the duplicated others_dr feature generated by add_all_ta_features is dropped to avoid redundancy.

2. Consolidate sparse PSAR trend signals
Among the trend indicators generated by add_all_ta_features, trend_psar_up and trend_psar_down exhibit a high number of missing values due to the mechanics of the Parabolic SAR (PSAR), which trails price movements using a dynamic stop level. The PSAR is computed as:

$$ PSAR_t = PSAR_{t−1} + AF \cdot (EP_{t−1} − PSAR_{t−1}) $$ Where:

  • $AF$ is the acceleration factor, starting at $0.02$ and capped at $0.2$

  • $EP$ is the extreme point (highest high during an uptrend or lowest low during a downtrend)

Explanation of the features:

  • trend_psar_up: Contains PSAR values during uptrends (dots below price), otherwise NaN

  • trend_psar_down: Contains PSAR values during downtrends (dots above price), otherwise NaN

To simplify, I create a new feature column psar_trend with directional encoding to interpret the information from these 2 features::

  • 1 for uptrend

  • -1 for downtrend

  • 0 for neutral

After generating psar_trend, the original two PSAR columns are removed to reduce sparsity and simplify the feature set.

3. Transform drift/non-stationary price-reflective features
Several generated features directly reflect price movement, inheriting drift and non-stationarity:

  • trend_sma_fast, trend_sma_slow, trend_ema_fast, trend_ema_slow

  • trend_ichimoku_conv, trend_ichimoku_base, trend_ichimoku_a, trend_ichimoku_b

  • trend_visual_ichimoku_a, trend_visual_ichimoku_b

Using standard or minmax scalers on them would cause data leakage, as future non-stationary values affect scaling. To address this, these features are converted into relative ratios by being divided with the day’s closing price. For example:
$$\text{trend_sma_fast_div_close} = \frac{\text{trend_sma_fast}}{\text{close}}$$

This transformation reduces drift, normalizes scale, and preserves interpretability without leakage. The original columns of them are then removed and replaced by the newly created ones.

4. Add more features for daily variance
Finally, I manually include two basic range-based features to capture intraday price movement: $$\text{h-l} = \text{high} - \text{low}$$ $$\text{o-c} = \text{open} - \text{close}$$

After applying feature enrichment from add_all_ta_features and above methods, we now have 95 columns in the main dataset:

  • Original OHLCV data: open, high, low, close, volume

  • Return and target: return, target

  • Trigonometric weekday features: dsin, dcos

  • Volume features: volume_adi, volume_obv, volume_cmf, volume_fi, volume_em, volume_sma_em, volume_vpt, volume_vwap, volume_mfi, volume_nvi

  • Volatility features: volatility_bbm, volatility_bbh, volatility_bbl, volatility_bbw, volatility_bbp, volatility_bbhi, volatility_bbli, volatility_kcc, volatility_kch, volatility_kcl, volatility_kcw, volatility_kcp, volatility_kchi, volatility_kcli, volatility_dcl, volatility_dch, volatility_dcm, volatility_dcw, volatility_dcp, volatility_atr, volatility_ui

  • Trend features: trend_macd, trend_macd_signal, trend_macd_diff, trend_vortex_ind_pos, trend_vortex_ind_neg, trend_vortex_ind_diff, trend_trix, trend_mass_index, trend_dpo, trend_kst, trend_kst_sig, trend_kst_diff, trend_stc, trend_adx, trend_adx_pos, trend_adx_neg, trend_cci, trend_aroon_up, trend_aroon_down, trend_aroon_ind, trend_psar_up_indicator, trend_psar_down_indicator, psar_trend

  • Transformed trend features (drift removed): trend_sma_fast_div_close, trend_sma_slow_div_close, trend_ema_fast_div_close, trend_ema_slow_div_close, trend_ichimoku_conv_div_close, trend_ichimoku_base_div_close, trend_ichimoku_a_div_close, trend_ichimoku_b_div_close, trend_visual_ichimoku_a_div_close, trend_visual_ichimoku_b_div_close

  • Momentum features: momentum_rsi, momentum_stoch_rsi, momentum_stoch_rsi_k, momentum_stoch_rsi_d, momentum_tsi, momentum_uo, momentum_stoch, momentum_stoch_signal, momentum_wr, momentum_ao, momentum_roc, momentum_ppo, momentum_ppo_signal, momentum_ppo_hist, momentum_pvo, momentum_pvo_signal, momentum_pvo_hist, momentum_kama

  • Other return related features: others_dlr, others_cr,

  • Daily variance features: h-l, o-c

All columns contain numeric values, either as floats or integers.

Back to Content

2.2.2 Features from other resources¶

Stock prices can be influenced by a variety of factors beyond trading data itself—such as investor sentiment reflected in news coverage, CDS spreads, and dividend history.

After thorough research, I found that Adobe’s CDS spread data is not publicly available, and the company has not issued any dividends since 2005. As a result, this section will focus on extracting sentiment signals from Adobe-related news headlines published between 1 Jan 2020 and 1 July 2025.

The news data was downloaded and processed in the DL Tuo Li CODE/DL Tuo Li Appendix_news_process.ipynb notebook. The processed news dataset is indexed by date, spanning from January 1, 2020 to July 1, 2025, aligning with the main dataset. It contains two columns that quantify news-driven signals for each calendar day:

  • news_sentiment_score: indicates whether the news titles expressed a positive and negative feeling that could impact Adobe's price. The value is between $-1$ and $1$, and $-1$ means highly negative, $0$ means nuetral while $1$ means highly positive.

  • news_emotion_intensity: determines if the titles' emotion is high or Low. The value ranges from $0$ to $1$, and $0$ means extremely low while $1$ means extremely high.

These metrics were generated by the AI LLM model that automatically analyzed daily Adobe news titles. If no relevant news was published on a particular day, that date is omitted from the index.

Here is the first 5 row of this news dataset:

date news_sentiment_score news_emotion_intensity
2020-01-01 0.000 0.000
2020-01-03 0.450 0.580
2020-01-06 0.000 0.000
2020-01-09 0.150 0.400
2020-01-10 0.000 0.000

Then I merge it with the main dataset. And the main dataset has 97 columns now.

Back to Content

2.2.3 Features from macro environment¶

2.2.3.1 Features from QQQ¶

Invesco QQQ (ticker: QQQ) is an exchange-traded fund (ETF) that tracks the Nasdaq-100 Index, which includes 100 of the largest non-financial companies listed on the Nasdaq, such as Apple, Amazon, Adobe and Nvidia.

It is heavily weighted toward large-cap technology stocks, and often viewed as a proxy for the tech sector’s overall health. QQQ's price reflects aggregated investor sentiment toward high-growth, large-cap tech stocks.

QQQ can act as a sentiment barometer for Adobe’s ecosystem. Including QQQ data will improve the model’s ability to predict uptrend probabilities by incorporating macro and sector signals.

The QQQ trading data was also sourced from MacroTrend website, with standard OHLCV information.

Below is the preview of the QQQ dataset with first 5 rows:

date open high low close volume weekday
2020-01-02 207.096 208.796 206.690 208.796 29958247 Thu
2020-01-03 206.023 208.129 206.014 206.883 26594637 Fri
2020-01-06 205.251 208.245 205.009 208.216 20986764 Mon
2020-01-07 208.293 208.776 207.530 208.187 22333269 Tue
2020-01-08 208.129 210.708 207.830 209.752 25562588 Wed

Stastical summary of the numerical columns:

count mean std min 25% 50% 75% max
open 1381 353.366 89.170 165.448 287.310 344.309 422.327 551.260
high 1381 356.212 89.304 168.633 289.908 346.220 426.196 552.800
low 1381 350.316 88.836 159.650 283.839 341.277 419.791 549.010
close 1381 353.462 89.109 163.532 286.743 343.896 422.211 551.640
volume 1381 48326455.122 21716833.598 15225092 33335184 44701851 57932013 194966806

In order to better leverage the information provided by QQQ trading data, we will also need to perform some feature engineering with this dataset.

First, generate a new feature qqq_adobe_corr_20 that captures the correlation between QQQ and Adobe daily returns. By calculating a 20-day rolling correlation, this feature reflects sector-level alignment and helps quantify how closely Adobe’s price movements track broader technology trends represented by QQQ.

Then, add technical features from QQQ's trading data. As QQQ's data is only supportive in this project, I won't add the full set of technical features. Instead, I will focus on several key ones:

  • 10 day SMA divided by close price and 10 day EMA divided by close price

  • ATR, BBANDS, RIS and MACD

  • H-L and O-C

Finally, select relevant QQQ features and merge it with the main dataset.

Below is a preview of the selected QQQ features with last 5 rows. For ease of display, it is split into two parts.

date qqq_volume qqq_adobe_corr_20 qqq_sma_10_div_close qqq_ema_10_div_close
2025-06-25 44804200 0.643 0.983 0.984
2025-06-26 43811400 0.635 0.977 0.980
2025-06-27 57577100 0.638 0.976 0.981
2025-06-30 45548700 0.661 0.974 0.979
2025-07-01 56166700 0.638 0.985 0.990
date qqq_atr qqq_bbands_l qqq_bbands_m qqq_bbands_u qqq_rsi qqq_macd qqq_h-l qqq_o-c
2025-06-25 6.799 520.018 533.445 546.872 62.291 8.254 3.930 0.900
2025-06-26 6.452 521.058 537.010 552.962 70.425 8.802 5.150 -2.870
2025-06-27 6.341 528.506 541.380 554.254 68.525 9.280 5.450 -0.830
2025-06-30 6.439 535.559 545.378 555.197 70.158 9.832 3.790 -0.380
2025-07-01 6.509 539.252 546.820 554.388 62.261 9.781 6.050 2.740

As there are 12 features from QQQ newly included to the main dataset, we now have 109 columns in the main dataset.

Back to Content

2.2.3.2 Features from macro economy¶

Next, the Federal Funds Effective Rate is very possibly influencing Adobe share price movements, as the technology sector is highly sensitive to borrowing costs and liquidity, both of which are directly affected by rate fluctuations.

I have sourced monthly data from the Federal Reserve Bank of St. Louis and this rate is published monthly. Below is a preview with first 5 rows of the data.

observation_date FEDFUNDS
2020-01-01 1.550
2020-02-01 1.580
2020-03-01 0.650
2020-04-01 0.050
2020-05-01 0.050

As the federal rate data is on a monthly basis, we need to map it to our main dataset using a forward-filling technique, ensuring that in the merged dataset, each day's federal rate correctly aligns with the corresponding monthly value.

After incorporating the federal rate data, the feature generation is completed.

We now have 110 columns in the main dataset:

  • Original OHLCV data: open, high, low, close, volume

  • Return and target: return, target

  • Trigonometric weekday features: dsin, dcos

  • Volume features: volume_adi, volume_obv, volume_cmf, volume_fi, volume_em, volume_sma_em, volume_vpt, volume_vwap, volume_mfi, volume_nvi

  • Volatility features: volatility_bbm, volatility_bbh, volatility_bbl, volatility_bbw, volatility_bbp, volatility_bbhi, volatility_bbli, volatility_kcc, volatility_kch, volatility_kcl, volatility_kcw, volatility_kcp, volatility_kchi, volatility_kcli, volatility_dcl, volatility_dch, volatility_dcm, volatility_dcw, volatility_dcp, volatility_atr, volatility_ui

  • Trend features: trend_macd, trend_macd_signal, trend_macd_diff, trend_vortex_ind_pos, trend_vortex_ind_neg, trend_vortex_ind_diff, trend_trix, trend_mass_index, trend_dpo, trend_kst, trend_kst_sig, trend_kst_diff, trend_stc, trend_adx, trend_adx_pos, trend_adx_neg, trend_cci, trend_aroon_up, trend_aroon_down, trend_aroon_ind, trend_psar_up_indicator, trend_psar_down_indicator, psar_trend

  • Transformed trend features (drift removed): trend_sma_fast_div_close, trend_sma_slow_div_close, trend_ema_fast_div_close, trend_ema_slow_div_close, trend_ichimoku_conv_div_close, trend_ichimoku_base_div_close, trend_ichimoku_a_div_close, trend_ichimoku_b_div_close, trend_visual_ichimoku_a_div_close, trend_visual_ichimoku_b_div_close

  • Momentum features: momentum_rsi, momentum_stoch_rsi, momentum_stoch_rsi_k, momentum_stoch_rsi_d, momentum_tsi, momentum_uo, momentum_stoch, momentum_stoch_signal, momentum_wr, momentum_ao, momentum_roc, momentum_ppo, momentum_ppo_signal, momentum_ppo_hist, momentum_pvo, momentum_pvo_signal, momentum_pvo_hist, momentum_kama

  • Other return related features: others_dlr, others_cr,

  • Daily variance features: h-l, o-c

  • News sentiment features: news_sentiment_score, news_emotion_intensity

  • QQQ related features: qqq_volume, qqq_adobe_corr_20, qqq_sma_10_div_close, qqq_ema_10_div_close, qqq_atr, qqq_bbands_l, qqq_bbands_m, qqq_bbands_u, qqq_rsi, qqq_macd, qqq_h-l, qqq_o-c,

  • Federal fund rate FEDFUNDS

All columns contain numeric values, either as floats or integers.

Back to Content

3. Exploratory Data Analysis (EDA)¶

3.1 Structural evaluation¶

3.1.1 Handle the missing values¶

All data were initially sourced without any missing values. However, during feature generation, $NaN$ values were introduced due to the nature of certain calculation formulas. These missing values typically occur in the initial rows of the dataset, where some formulas couldn’t compute feature values.

To maintain data integrity, we removed all rows containing $NaN$ values. Prior to this cleanup, the dataset comprised 1381 rows; after removal, it contains 1310 rows. As a result, the dataset now begins on April 15, 2020.

3.1.2 Create the feature X and the target y¶

To prepare for exploratory data analysis and future deep learning model training, we begin by constructing the feature dataset X and target dataset y.

  • X contains all columns from the main dataset, excluding the non-stationary pricing fields (open, high, low, close) and the target variable. It includes 105 columns in total.

  • y corresponds to the target column and represents the prediction objective.

  • Both X and y consist of 1,310 rows.

Since X encompasses all engineered features, it will be the primary focus in the upcoming exploratory analysis.

Back to Content

3.1.3 Explore feature distribution and analyze outliers¶

3.1.3.1 Visualize the features¶

This section focuses on analyzing feature distributions to help with suitable scaler selection.

I’ll begin by visualizing each feature in X using histograms and KDE plots to guide the categorization and scaling strategy:

3_feature_visualization1
3_feature_visualization2

Back to Content

3.1.3.2 Group the features¶

With the visualization of the 105 features, I will group them to 3 categories. Each of the group is paired with an appropriate scaling method to ensure consistent preprocessing:

  • Bounded distribution: Indicators constrained to a fixed range, showing skewed, bimodal, or other patterns. Scaled using MinMaxScaler.

  • Normal distribution: Features with symmetric spread and bell-shaped histograms. Best suited for StandardScaler.

  • Skewed / long-Tailed / clustered distribution: Features with heavy skew, long tails, or clustered spikes. These are sensitive to outliers and scaled using RobustScaler.

After learning the features and observing their distributions, I can reach some initial grouping decisions:

  1. Based on the definition and with reference to the plots, I identified clear candidates for bounded distribution.
    Bounded distribution:
    dsin, dcos, trend_stc, trend_adx, trend_adx_pos, trend_adx_neg, trend_aroon_up, trend_aroon_down, trend_aroon_ind, trend_psar_up_indicator, trend_psar_down_indicator, momentum_rsi, momentum_stoch_rsi, momentum_stoch_rsi_k, momentum_stoch_rsi_d, momentum_tsi, momentum_uo, momentum_stoch, momentum_stoch_signal, momentum_wr, psar_trend, news_sentiment_score, news_emotion_intensity, qqq_rsi, FEDFUNDS

  2. From the plots, features with obvious skew, long-tails, or clusters can be spotted as well.
    Skewed / long-Tailed / clustered distribution:
    volume, volume_adi, volume_obv, volume_cmf, volume_sma_em, volume_vpt, volume_vwap, volume_mfi, volume_nvi, volatility_bbm, volatility_bbh, volatility_bbl, volatility_bbw, volatility_bbp, volatility_bbhi, volatility_bbli, volatility_kcc, volatility_kch, volatility_kcl, volatility_kcw, volatility_kchi, volatility_kcli, volatility_dcl, volatility_dch, volatility_dcm, volatility_dcw, volatility_dcp, volatility_atr, volatility_ui, trend_vortex_ind_pos, trend_vortex_ind_neg, trend_vortex_ind_diff, trend_mass_index, trend_cci, trend_ichimoku_conv_div_close, trend_ichimoku_base_div_close, trend_ichimoku_a_div_close, trend_ichimoku_b_div_close, trend_visual_ichimoku_a_div_close, trend_visual_ichimoku_b_div_close, momentum_ppo, momentum_ppo_signal, momentum_pvo, momentum_pvo_signal, momentum_kama, others_cr, h-l, qqq_volume, qqq_adobe_corr_20, qqq_atr, qqq_bbands_l, qqq_bbands_m, qqq_bbands_u, qqq_macd, qqq_h-l

  3. Outlier analysis will be used to finalize the categorization for the rest. These features appear approximately normal, but it is unclear whether their potential outliers may cause them to behave like long-tailed distributions.
    Features to be examed with outlier analysis:
    return, volume_fi, volume_em, volatility_kcp, trend_macd, trend_macd_signal, trend_macd_diff, trend_trix, trend_dpo, trend_kst, trend_kst_sig, trend_kst_diff, trend_sma_fast_div_close, trend_sma_slow_div_close, trend_ema_fast_div_close, trend_ema_slow_div_close, momentum_ao, momentum_roc, momentum_ppo_hist, momentum_pvo_hist, others_dlr, o-c, qqq_sma_10_div_close, qqq_ema_10_div_close, qqq_o-c

Outlier analysis:

For the 25 features that need outlier analysis, I will apply two complementary metrics: the skewness coefficient and the Interquartile Range (IQR).

  • Skewness coefficient
    It quantifies the asymmetry of the distribution:

    • $0$ : perfectly symmetric distribution

    • Positive : right-skewed

    • Negative : left-skewed

    Intepretation:

    • $|\text{Skewness}| < 0.5$ : fairly symmetric, typically fine for StandardScaler

    • $0.5 ≤ |\text{Skewness}| ≤ 1$ : moderately skewed

    • $|\text{Skewness}| > 1$ : heavily skewed, often requiring RobustScaler or transformations to reduce skew if needed

  • Interquartile Range (IQR)
    This method identifies statistical outliers by examining how far values deviate from the middle 50% of the data—offering a robust and non-parametric approach that doesn’t assume normality:

    • $IQR$ is the gap between the values of the 75th percentile (Q3) and the 25th percentile (Q1).

    • Values falling outside the range $[Q1 − 1.5 × IQR, Q3 + 1.5 × IQR]$ are considered outliers.

    • If outliers account for less than 5% of the data, we consider the feature suitable for normal distribution treatment.

  • Deicision logic
    A feature will be categorized as normally distributed if both conditions are met:

    • $|\text{Skewness}| < 0.5$

    • $\text{Outlier ratio} < 5\%$

    Otherwise, it will be classified as skewed / long-tailed and scaled accordingly.

We will use this decision to update the above feature lists. Meanwhile, to support this process, for the 25 features in the outlier analysis, we generate:

  • Histogram + KDE plots for shape and modality

  • Box plots to visualize outliers and IQR

  • Q-Q plots to assess normality alignment

  • A summary table with skewness, outlier ratio, and categorization

Below are the histogram, KDE plots, box plots and Q-Q plots of the features.

4_outlier_analysis1
4_outlier_analysis2
4_outlier_analysis3
4_outlier_analysis4
4_outlier_analysis5

Below is the summary table of this outlier analysis with skewness, outlier ratio, and categorization for the features.

Feature Skewness Outlier % Distribution
return -0.821 4.120 Skewed/Long-tailed
volume_fi -3.316 10.760 Skewed/Long-tailed
volume_em -0.897 5.730 Skewed/Long-tailed
volatility_kcp -0.441 2.060 Normal
trend_macd -0.233 0.310 Normal
trend_macd_signal -0.222 0.080 Normal
trend_macd_diff -0.270 0.990 Normal
trend_trix -0.281 1.150 Normal
trend_dpo 0.393 2.370 Normal
trend_kst -0.057 1.910 Normal
trend_kst_sig -0.050 1.760 Normal
trend_kst_diff 0.140 0.310 Normal
trend_sma_fast_div_close 0.976 3.590 Skewed/Long-tailed
trend_sma_slow_div_close 0.779 1.070 Skewed/Long-tailed
trend_ema_fast_div_close 0.958 2.900 Skewed/Long-tailed
trend_ema_slow_div_close 0.852 1.530 Skewed/Long-tailed
momentum_ao -0.280 0.920 Normal
momentum_roc -0.141 1.530 Normal
momentum_ppo_hist -0.115 1.220 Normal
momentum_pvo_hist 1.106 3.510 Skewed/Long-tailed
others_dlr -1.141 4.050 Skewed/Long-tailed
o-c 0.240 1.910 Normal
qqq_sma_10_div_close 0.764 2.140 Skewed/Long-tailed
qqq_ema_10_div_close 0.827 2.520 Skewed/Long-tailed
qqq_o-c -0.720 3.360 Skewed/Long-tailed

Based on this outlier analysis, we now have finalized categorization for all the 105 features:

Bounded distribution:
dsin, dcos, trend_stc, trend_adx, trend_adx_pos, trend_adx_neg, trend_aroon_up, trend_aroon_down, trend_aroon_ind, trend_psar_up_indicator, trend_psar_down_indicator, momentum_rsi, momentum_stoch_rsi, momentum_stoch_rsi_k, momentum_stoch_rsi_d, momentum_tsi, momentum_uo, momentum_stoch, momentum_stoch_signal, momentum_wr, psar_trend, news_sentiment_score, news_emotion_intensity, qqq_rsi, FEDFUNDS

Skewed / long-Tailed / clustered distribution:
volume, volume_adi, volume_obv, volume_cmf, volume_sma_em, volume_vpt, volume_vwap, volume_mfi, volume_nvi, volatility_bbm, volatility_bbh, volatility_bbl, volatility_bbw, volatility_bbp, volatility_bbhi, volatility_bbli, volatility_kcc, volatility_kch, volatility_kcl, volatility_kcw, volatility_kchi, volatility_kcli, volatility_dcl, volatility_dch, volatility_dcm, volatility_dcw, volatility_dcp, volatility_atr, volatility_ui, trend_vortex_ind_pos, trend_vortex_ind_neg, trend_vortex_ind_diff, trend_mass_index, trend_cci, trend_ichimoku_conv_div_close, trend_ichimoku_base_div_close, trend_ichimoku_a_div_close, trend_ichimoku_b_div_close, trend_visual_ichimoku_a_div_close, trend_visual_ichimoku_b_div_close, momentum_ppo, momentum_ppo_signal, momentum_pvo, momentum_pvo_signal, momentum_kama, others_cr, h-l, qqq_volume, qqq_adobe_corr_20, qqq_atr, qqq_bbands_l, qqq_bbands_m, qqq_bbands_u, qqq_macd, qqq_h-l, return, volume_fi, volume_em, trend_sma_fast_div_close, trend_sma_slow_div_close, trend_ema_fast_div_close, trend_ema_slow_div_close, momentum_pvo_hist, others_dlr, qqq_sma_10_div_close, qqq_ema_10_div_close, qqq_o-c

Normal distribution:
volatility_kcp, trend_macd, trend_macd_signal, trend_macd_diff, trend_trix, trend_dpo, trend_kst, trend_kst_sig, trend_kst_diff, momentum_ao, momentum_roc, momentum_ppo_hist, o-c

Back to Content

3.1.3.3 Scale the features¶

We’ve completed the structural evaluation of all features by analyzing their distributions and categorizing them into three groups based on scaling suitability.

To close this section, we will scale the features according to the defined categories to ensure consistency in future observation, analysis and model training.

  • Features with normal distribution will be scaled with StandardScaler.

  • Features with bounded distribution will be scaled with MinMaxScaler.

  • Features with skewed / long-Tailed / clustered distribution will be scaled with RobustScaler.

The scaled features are stored in a new dataset X_scaled. It has the same structure (105 columns and 1310 rows) as X, except all its values are properly scaled.

Back to Content

3.2 SHAP analysis and feature relationship exploration¶

The goal of this section is to uncover relationships among features, providing insights into potential multicollinearity and dependencies. Given the large number of features (105), it is impractical to study the bilateral relationships for every possible pair.

To address this efficiently, I will first apply SHAP analysis to identify the 10 most impactful features for the prediction task. These top features will then be explored with pairwise scatter plots. The least relevant features will also be detected and then removed in this process.

3.2.1 SHAP analysis¶

SHAP analysis is a powerful method for interpreting machine learning models by quantifying the impact of each feature on a prediction. In this study, we apply SHAP to an XGBoost classifier, evaluating the influence of features in X_scaled on the target variable y.

XGBoost demonstrated strong performance during the evaluation of my Exam 3 project, which justifies its selection as the model for SHAP-based interpretability in this analysis.

Below is the visualization of the 20 most influential features from the SHAP analysis.

5_SHAP

The SHAP summary plot above ranks the top 20 features by their impact on the XGBoost classifier’s predictions, sorted from highest to lowest importance.

Each dot corresponds to a single day's observation. Its horizontal position reflects the feature’s influence on the model output—pushing the prediction toward 1 (right) or toward 0 (left). For example:

  • A red dot for return positioned on the left suggests that a higher daily return reduces the likelihood of predicting class 0.

  • A blue dot for o-c located on the right indicates that a lower open-close price difference increases the probability of class 1.

Then, we calculate the mean absolute SHAP values as the indicator for the importance of the features.

Usually, a mean abusolute SHAP value above $0.1$ typically indicates a feature with strong predictive impact. Among our 105 features, 41 have their values meet this standard.

On the contrary, features with mean abusolute SHAP values below $0.01$ tend to contribute negligibly to the model and may represent statistical noise or redundancy. And in our case, 11 features fall below this threshold.

The result indicates good feature construction: approximately 40% of features exhibit strong influence, contributing meaningfully to model decisions, while only 10% show minimal impact, suggesting limited predictive value.

Next, we will:

  • Extract the top 10 most impactful features for deeper relationship analysis.
  • Remove the 11 features with mean absolute SHAP values below 0.01, as they contribute minimally to model performance.

Below are the top 10 features and their mean absolute SHAP values. We will explore their relationships in the next section.

Feature Mean Abs SHAP
return 0.458
o-c 0.280
volume_adi 0.248
qqq_h-l 0.244
trend_vortex_ind_diff 0.240
trend_dpo 0.233
volatility_atr 0.220
h-l 0.209
volume_cmf 0.194
trend_vortex_ind_pos 0.185

Below are the 11 featuers whose mean absolute SHAP values are under $0.01$. They will be removed from the feature dataset X_scaled.

Feature Mean Abs SHAP
volatility_dcl 0.008
volatility_bbhi 0.000
others_dlr 0.000
volatility_dcm 0.000
volatility_kchi 0.000
volatility_kcli 0.000
trend_psar_down_indicator 0.000
volatility_bbli 0.000
psar_trend 0.000
trend_psar_up_indicator 0.000
momentum_wr 0.000

After the removal, we now have 94 features in X_scaled.

Back to Content

3.2.2 Explore relationships among high-impact features¶

To uncover potential dependencies and interaction patterns, I use multi-scatter plots (pairplots) on the top 10 most impactful features identified via SHAP analysis.

6_scatter_plot

As illustrated in the visualizations:

  • Many plots show feature values clustered along vertical lines, suggesting weak correlation between those pairs.

  • trend_vortex_ind_diff and trend_vortex_ind_pos exhibit a strong positive correlation, forming a pronounced diagonal across their scatter plot.

  • Moderate positive correlations are observed between:

    • trend_vortex_ind_pos and volume_cmf

    • trend_vortex_ind_diff and volume_cmf

    • h-l and volatility_atr

  • Moderate negative correlations appear between:

    • o-c and return

    • trend_vortex_ind_diff and trend_dpo

  • Most feature comparisons involving volume_adi produce two distinct clusters, indicating potential segmentation or bifurcation in behavior.

  • The relationship between o-c and h-l is context-dependent:

    • When o-c is negative, the correlation is negative

    • When o-c is positive, the correlation appears positive

These insights help pinpoint underlying dependencies and potential non-linear interactions within the most impactful features. I will document these findings for reference but hold off on taking any immediate action. These observations will be revisited during the multicollinearity analysis phase to inform further feature removal.

Back to Content

3.3 Analyze multi-collinearity and reduce dimensionality¶

3.3.1 VIF analysis and correlation heatmap¶

Multicollinearity occurs when two or more independent variables are highly correlated with one another in a regression model, and it can be detected using Variable Inflation Factors (VIF).

VIF score of an independent variable represents how well the variable is explained by other independent variables, and it is calculated by the following formula: $$ VIF = \frac{1}{1 − R^2} $$

$R^2$ value is determined to find out how well an independent variable is described by the other independent variables. A high value of $R^2$ means that the variable is highly correlated with the other variables.

VIF starts at $1$ (no correlation) and has no upper limit. In this case, we will remove highly correlated features and target to keep the remaining ones with VIF below $10$.

Compute VIF of the current 94 scaled features and check the 50 features with highest VIF scores below:

Features VIF Score
trend_aroon_down inf
volatility_kch inf
trend_ichimoku_base_div_close inf
trend_ichimoku_conv_div_close inf
trend_macd inf
momentum_ppo_signal inf
trend_macd_diff inf
trend_vortex_ind_pos inf
trend_vortex_ind_neg inf
trend_vortex_ind_diff inf
trend_kst inf
trend_kst_sig inf
trend_kst_diff inf
trend_aroon_up inf
momentum_pvo_hist inf
trend_aroon_ind inf
momentum_pvo_signal inf
momentum_pvo inf
momentum_ppo_hist inf
volatility_kcl inf
trend_macd_signal inf
volatility_kcc inf
volatility_bbl inf
qqq_bbands_u inf
qqq_bbands_m inf
qqq_bbands_l inf
trend_ichimoku_a_div_close inf
volatility_bbm inf
volatility_bbh inf
momentum_ppo inf
trend_trix 9482.907
trend_ema_slow_div_close 8646.385
others_cr 5781.741
trend_ema_fast_div_close 3332.137
volume_vwap 2998.145
volatility_dch 1427.714
trend_sma_slow_div_close 960.215
trend_sma_fast_div_close 740.427
momentum_kama 615.726
momentum_ao 298.481
momentum_rsi 221.841
volatility_bbp 216.140
trend_visual_ichimoku_a_div_close 212.918
momentum_tsi 176.274
qqq_ema_10_div_close 130.436
trend_cci 110.998
qqq_sma_10_div_close 98.291
volatility_dcp 93.383
volume_vpt 88.051
momentum_stoch_rsi_k 83.440

Use heatmap to demonstrate the correlations.

7_heatmap1

As shown in the table, a substantial portion of the feature set exhibits extremely high VIF scores, ranging from four-digit values to infinity, signaling deep and complex interdependencies.

This is vividly supported by the dense and vibrant correlation heatmap, which highlights a tightly woven matrix of relationships across features—visual evidence of severe multicollinearity.

Next, we will reudce dimensionality to mitigate the high multi-collinearity among the features.

Back to Content

3.3.2 Reduce dimensionality¶

To address feature multicollinearity and enhance model robustness, we will try two dimensionality reduction techniques:

  1. Cluster-based selection

    • Apply unsupervised machine learning (e.g., KMeans) to group features based on similarity.

    • Within each cluster, retain only the feature with the highest mean absolute SHAP value to represent the group.

  2. Correlation-based filtering

    • Identify highly correlated feature pairs across the dataset.

    • For each pair, remove the feature with the lower mean absolute SHAP value to reduce redundancy.

PCA is avoided to preserve interpretability. Additionally, PCA only captures linear dependencies, while our exploratory scatter plots show nonlinear relationships, limiting PCA’s effectiveness in preserving feature structure.

3.3.2.1 Cluster-based selection¶

To apply K-Means clustering on the features, we first need to construct a "feature-of-features" dataset that captures the characteristics of each feature.

Since all features are numeric, we will use the standard .describe() method to summarize their statistical properties, effectively reflecting the quantitative nature of each feature.

There will 94 rows in this dataset, with each row representing a feature. Below is a preview of it with 5 rows:

count mean std min 25% 50% 75% Max
volume 1310.000 0.334 1.322 -1.461 -0.387 0.000 0.613 17.880
return 1310.000 -0.028 0.985 -7.339 -0.484 0.000 0.516 6.260
dsin 1310.000 0.578 0.382 0.000 0.277 0.723 0.901 1.000
dcos 1310.000 0.367 0.363 0.000 0.000 0.445 0.445 1.000
volume_adi 1310.000 -0.072 0.818 -3.181 -0.498 0.000 0.502 1.583

Next, we’ll apply the Elbow method to identify the optimal number of clusters that align with our objective. Specifically, we’ll group the 94 features into clusters ranging from 2 to 60 and examine how the relative inertia (the measure of within-cluster compactness) declines as the number of clusters increases.

8_elbow

The Elbow plot shows that relative inertia begins to plateau beyond 20 clusters, indicating that KMeans is approaching its limit in terms of meaningful compression. This suggests minimal gain from further increasing the number of clusters.

However, I am concerned that selecting only 20 clusters from 94 may risk discarding many potentially meaningful ones. To balance compression with feature diversity, I choose to retain 30 clusters instead.

Therefore, we group the features to 30 clusters and select a feature with the highest mean absolute SHAP value as a representative in each cluster.

After having the 30 representative features, let's review their VIF scores if they are kept as one feature set:

Features VIF Score
momentum_pvo 17808.064
momentum_pvo_signal 11406.629
momentum_pvo_hist 4729.073
trend_vortex_ind_diff 40.739
trend_vortex_ind_pos 35.398
trend_ichimoku_base_div_close 13.097
momentum_roc 10.804
volume_obv 7.978
volume_vwap 7.640
trend_adx_neg 7.230
trend_kst 6.268
volume_sma_em 6.024
volume_vpt 5.005
volume_fi 4.513
momentum_stoch_rsi 4.509
volume_adi 4.097
return 4.018
o-c 3.601
volume 3.524
qqq_volume 3.296
qqq_sma_10_div_close 3.067
volatility_atr 3.021
qqq_h-l 2.712
h-l 2.620
qqq_adobe_corr_20 2.334
qqq_o-c 2.318
volume_em 2.180
volatility_bbw 2.100
trend_dpo 2.074
trend_mass_index 1.948

As shown in the table above, while the cluster-based selection helps reduce multicollinearity by removing some features, it also comes with several limitations:

  • Super-high multicollinearity remains: Some features still exhibit VIF scores in the 5-digit range, which is highly alarming.

  • Loss of feature diversity: Key macroeconomic indicators like FEDFUNDS, sentiment-related features, and weekday-based features are excluded during selection. This reduction in feature variety is concerning.

  • No clear direction for cluster adjustment: I won't suggest to adjust the clustering amount as a solution, as increasing clusters may worsen multicollinearity, while decreasing them could further reduce feature diversity.

Given these concerns, I’ll pause further action and explore correlation-based feature selection instead.

Back to Content

3.3.2.2 Correlation-based selection¶

This approach is more straightforward: identify pairs of highly correlated features from the correlation matrix, then remove the one with the lower mean absolute SHAP value in each pair. This strategy balances redundancy reduction with preservation of predictive impact.

Frist, we set threshold at $0.85$, and filter out the pair of features with correlation score higher than it.

As an result, there are 201 feature pairs with correlation coefficients exceeding $0.85$, indicating a dense web of interrelationships. This high degree of multicollinearity suggests that removing even a single feature from a pair may lead to a notable reduction in VIF scores across multiple features.

Then, we examine each highly correlated pair and select the feature with the lower mean absolute SHAP value as a candidate for removal. And based on this process, 49 features are suggested to remove, so we will have 45 features left.

Now, we reassess the feature set of the 45 remaining ones to evaluate whether multicollinearity is sufficiently mitigated, and observe how VIF scores changes.

Features VIF Score
momentum_uo 41.073
volume_vpt 41.071
trend_adx_neg 40.871
trend_adx_pos 35.283
qqq_rsi 34.294
FEDFUNDS 34.048
volume_vwap 20.591
momentum_stoch_rsi 18.633
qqq_bbands_u 16.832
volatility_ui 15.668
trend_kst 13.643
momentum_roc 13.542
volume_obv 11.244
trend_ichimoku_conv_div_close 11.164
trend_stc 10.569
trend_vortex_ind_diff 10.361
volatility_atr 10.116
volatility_kcw 9.995
trend_cci 9.498
trend_aroon_up 9.246
qqq_macd 8.733
trend_kst_diff 8.149
trend_adx 7.869
qqq_atr 7.281
volume_fi 6.624
qqq_sma_10_div_close 5.396
momentum_pvo 5.268
volume_adi 4.883
volatility_dcw 4.660
momentum_pvo_hist 4.406
return 4.271
volume 4.241
dsin 3.910
o-c 3.894
qqq_volume 3.856
qqq_h-l 3.483
volume_cmf 3.347
qqq_adobe_corr_20 3.192
h-l 2.920
trend_mass_index 2.715
qqq_o-c 2.518
dcos 2.342
volume_em 2.339
trend_dpo 2.207
news_emotion_intensity 2.141

This result is more acceptable than that of the cluster-based selection:

  • Multicollinearity is significantly reduced, the highest VIF score is now around 40. While VIF scores between 10 and 40 still suggest moderate collinearity, they are unlikely to affect the model we will build substantially now, as LSTM neural networks are less sensitive to multicollinearity than linear models.

  • Feature diversity is preserved, including:

    • news_emotion_intensity for sentiment analysis

    • dcos for weekday patterns

    • FEDFUNDS for macroeconomic signals

Since the correlation-based filtering performs well, I will retain this refined feature set for modeling.

Check the correlation heatmap of updated features.

9_heatmap2

The successful reduction of collinearity is also evident in the more desaturated heatmap.

This concludes the feature selection process. Below is the final list of 45 features we'll use to move forward with the project:

  • Original OHLCV data: volume

  • Return: return

  • Trigonometric weekday features: dsin, dcos

  • Volume features: volume_adi, volume_obv, volume_cmf, volume_fi, volume_em, volume_vpt, volume_vwap

  • Volatility features: volatility_kcw, volatility_dcw, volatility_atr, volatility_ui

  • Trend features: trend_vortex_ind_diff, trend_mass_index, trend_dpo, trend_kst, trend_kst_diff, trend_stc, trend_adx, trend_adx_pos, trend_adx_neg, trend_cci, trend_aroon_up

  • Transformed trend features (drift removed): trend_ichimoku_conv_div_close

  • Momentum features: momentum_stoch_rsi, momentum_uo, momentum_roc, momentum_pvo, momentum_pvo_hist

  • Daily variance features: h-l, o-c

  • News sentiment features: news_emotion_intensity

  • QQQ related features: qqq_volume, qqq_adobe_corr_20, qqq_sma_10_div_close, qqq_atr, qqq_bbands_u, qqq_rsi, qqq_macd, qqq_h-l, qqq_o-c,

  • Federal fund rate FEDFUNDS

Next, we proceed to model building with our well-prepared feature set.

Back to Content

4. Model Building¶

Model building is the core of this project. In this chapter, we will explore several LSTM architectures and fine-tune them to identify the optimal structure and hyperparameter configuration for predicting Adobe's next-day upward movement.

Below are some key considerations for this model building exercise:

  1. Architecture exploreation

    We concentrate on LSTM models with 2 or 3 layers, with or without dropout:

    • Baseline model: 2-layer LSTM without dropout

    • Additional variants:

      • A: 2-layer LSTM with dropout

      • B: 3-layer LSTM without dropout

      • C: 3-layer LSTM with dropout

    • All models include a final Dense output layer with sigmoid activation for binary classification.

    • Models are evaluated using the metrics of AUC, F1 Score, Accuracy, Recall and Precision.

    • Due to the limited sample size (n = 1310), architectures with 4 or more LSTM layers are excluded to avoid overfitting.

  2. Hyperparameter optimization

    For each candidate architecture, we will use Bayesian Optimization to tune:

    • Number of LSTM units per layer: $5$ to $25$, at the step of $5$

    • Dropout rates (where applicable): $[0.3, 0.4, 0.5, 0.6]$

    • Learning rate: $[0.0005, 0.001, 0.002]$

    • Activation functions: $[\text{'elu'}, \text{'relu'}]$

  3. Other key hyperparameter design choices:

    • Optimization objective: maximize accuracy of validation set (val_accuracy). Maximizing validation AUC can often lead to zero class 1 prediction in this project due to balanced targets.

    • Training epochs: $200$

    • Early stopping patience: $20$

    • Optimizer: Adam

  4. Reproducibility considerations:

    • Due to TensorFlow’s non‑determinism (LSTM kernels, GPU parallelism) and the stochastic nature of Bayesian search, exact reproducibility of tuning results is not guaranteed.

    • To mitigate variance, each architecture’s tuning will be run three times, and the best result from those runs will be saved and represent that model.

    • Although individual runs vary, performance across runs for a given structure should fall within a consistent range, allowing meaningful comparisons.

  5. Others:

    • Training progress is logged using TensorBoard for transparent tracking.

    • Among the 4 optimized models, the one with best evaluation result on testing set will be selected for backtesting in the next chapter.

Back to Content

4.1 Prepare dataset¶

4.1.1 Split and scale the dataset¶

To ensure a leak-free modeling pipeline, we begin by reverting to the unscaled version of the dataset. From there:

  • Restrict the dataset to the 45 features finalized in previous EDA phases.

  • Split this unscaled data into training ($70\%$), validation ($15\%$) and testing sets ($15\%$), preserving chronological order for time-series integrity.

  • Fit scaling transformations (StandardScaler, MinMaxScaler, RobustScaler for respective features as appropriate) only on the training set to prevent future data (validation and testing sets) from influencing learned parameters.

  • Apply the fitted scalers to the validation and testing sets to maintain consistency.

This approach preserves temporal structure and statistical independence across splits, laying a clean foundation for reliable LSTM training.

We now have the following datasets prepared for training and evaluation:

  • Training set: X_train_scaled ($917$ samples, $45$ features), y_train ($917$ target labels)

  • Validation set: X_val_scaled ($196$ samples, $45$ features), y_val ($196$ target labels)

  • Testing set: X_test_scaled ($197$ samples, $45$ features), y_test ($197$ target labels)

The datasets are now fully prepared for LSTM model building.

Back to Content

4.1.2 Create data generator¶

In this project of building LSTM model, a critical design choice is selecting an appropriate sequence length — the number of consecutive past days used as input for predicting the following day.

For a technology stock like Adobe, a span of approximately one month ($20–22$ trading days) offers a practical balance:

  • Captures medium-term momentum and volatility trends

  • Preserves recent signal strength

  • Avoids overextending input length, which could dilute temporal relevance and increase model complexity

Based on this reasoning, we set the sequence length to $21$ days.

Meanwhile, we use TimeseriesGenerator from Keras to efficiently generate batches of sequential data for time-series modeling.

A batch refers to a group of input sequences of features and their corresponding target values processed together during training or evaluation. Batching improves computational efficiency and allows for smoother gradient updates during optimization.

Here, the batch size is set to $32$, balancing training stability and computational efficiency given the moderate dataset size. During generator creation, any remaining samples that fall short of a full batch are discarded.

Following this discussion, we will have three generators:

  • g_train: $917//32 = 28$ batches

  • g_val: $196//32 = 6$ batches

  • g_test: $197//32 = 6$ batches

Each batch contains:

  • $32$ (batch size) feature sequences, each representing the most recent $21$ (sequence length) trading days

  • Each day includes $45$ features, resulting in a feature batch shape of $(32, 21, 45)$

  • $32$ corresponding target values, each indicating whether the stock price moved up on the day following the 21-day window

Back to Content

4.2 Baseline model - 2 layer LSTM model without dropout¶

4.2.1 Build the model (2 layers without dropout)¶

After structuring the architecture and tuning the hyperparameters within the planned search space, the optimized model has the following specifications:

  • Layer 1 units : $20$

  • Layer 2 units : $5$

  • Learning rate : $0.002$

  • Activation (layer 1) : relu

  • Activation (layer 2) : relu

The illustration below depicts the finalized model structure.

10_2layer_no_dropout

Screenshots of key visuals on TensorBoard results (TIME SERIES tab):

epoch_accuracy
11_2layer_no_dropout_accuracy

epoch_auc
11_2layer_no_dropout_auc

epoch_loss
11_2layer_no_dropout_loss

Intepreatation:

  • In the epoch_accuracy plot, across all the hyperparameter trials, most of them showed increasing validation accuracy, roughly converging in 3 clusters $[0.5, 0.56, 0.63]$ by epoch 25. The best-performing run achieved a validation accuracy of ~$0.68$, indicating solid learning progress. This reflects effective hyperparameter tuning to improve the accuracy.

  • The epoch_auc plot shows that most model configurations gradually improve their validation AUC over training epochs. Although the AUCs of some trails started around or below $0.5$ (random guessing), many of them reached above $0.65$, and the best even delivers $0.74$, indicating good discriminative power accumulated during the training.

  • The epoch_loss panel shows the binary cross-entropy loss over training epochs for different hyperparameter configurations. Among all the trials, approximately half showed steadily decreasing loss values, indicating effective learning. Other trials exhibited oscillating loss curves and entered early stopping very soon, suggesting less effective hyperparameter settings.

  • Generally speaking, most trails can finish learning around 30 epochs with improved accuracy, AUC and decreased loss. The optimized model from this training experience should demonstrate certain value in the later evaluation.

Back to Content

4.2.2 Evaluate the model (2 layers without dropout)¶

4.2.2.1 Evaluate the training data against the testing data¶

The trained model was evaluated on both the training and testing datasets using g_train and g_test, respectively. The observed accuracies are:

  • Training accuracy: $0.63$
  • Testing accuracy: $0.55$

4.2.2.2 Use the model to generate predictions on the testing data and conduct a more comprehensive performance analysis¶

The model's performance on the testing dataset g_test is detailed below, including ROC Curve, Confusion Matrix and Classification Report:

12_2layer_no_dropout_roc

13_2layer_no_dropout_confusion_matrix

Classification Report - 2 Layers without Dropout
Class Precision Recall F1-Score Support
0 0.56 0.70 0.62 92
1 0.54 0.39 0.46 84
Accuracy 0.55 176
Macro Avg 0.55 0.54 0.54 176
Weighted Avg 0.55 0.55 0.54 176

4.2.2.3 Baseline model evaluation summary¶

Training Metrics

  • Accuracy: $0.63$

Testing Metrics

  • Accuracy: $0.55$

  • AUC: $0.52$

  • Class 1 precision: $0.54$

  • Class 1 recall: $0.39$

  • Class 1 F1 score: $0.46$

Interpretation

  • Moderate learning without extreme overfitting: Training accuracy of $0.63$ suggests the model learned moderately and testing accuracy of $0.55$, indicates limited generalization with some room for performance gains.

  • Limited discriminative power: AUC of $0.52$ shows near-random separation capability.

  • Weak signal capture: precision ($0.54$), recall ($0.39$), and F1 score ($0.46$) suggest the model struggles to identify upward trends.

  • It’s an acceptable starting point for iterative refinement. In our futher effort, additional layer may improve signal capture.

Back to Content

4.3 Variant model A - 2 layer LSTM model with dropout¶

4.3.1 Build the model (2 layers with dropout)¶

After structuring the architecture and tuning the hyperparameters within the planned search space, the optimized model has the following specifications:

  • Layer 1 units : $5$

  • Layer 2 units : $5$

  • Dropout rate after layer 1 : $0.5$

  • Learning rate : $0.002$

  • Activation (layer 1) : relu

  • Activation (layer 2) : relu

The illustration below depicts the finalized model structure.

14_2layer_dropout

Screenshots of key visuals on TensorBoard results (TIME SERIES tab):

epoch_accuracy
15_2layer_dropout_accuracy

epoch_auc
15_2layer_dropout_auc

epoch_loss
15_2layer_dropout_loss

Intepreatation:

  • In the epoch_accuracy plot, across all the hyperparameter trials, most showed increasing validation accuracy, and more trails had clear upward tendency than the baseline model. But the best-performing run achieved a validation accuracy of ~$0.66$, less than the baseline model. This shows adding dropout improves the chance of better performance, but may not necessarily push the upper limit of accuracy in our project.

  • The epoch_auc plot shows that most model configurations gradually improved their validation AUC over training epochs. Similar to the accuracy plot, more trails in this model demonstrated enhancement over the epochs than the baseline model, yet the best AUC $0.72$ didn't exceed the baseline model's record.

  • The epoch_loss also illustrates 2 splits: half trails showed steadily decreasing loss values, indicating effective learning. Others exhibited oscillating loss curves. Compared to the baseline model where all the trails finished within around $30$ epochs, many trails here with decreasing loss took more epochs (~$60$) to reach early stopping.

  • Generally, this model with dropout requires more training epochs before meeting early stopping criteria based on loss reduction. While most trials outperformed those from the baseline model, the top-performing trial did not surpass the strongest baseline counterpart.

Back to Content

4.3.2 Evaluate the model (2 layers with dropout)¶

4.3.2.1 Evaluate the training data against the testing data¶

The trained model was evaluated on both the training and testing datasets using g_train and g_test, respectively. The observed accuracies are:

  • Training accuracy: $0.58$
  • Testing accuracy: $0.51$

4.3.2.2 Use the model to generate predictions on the testing data and conduct a more comprehensive performance analysis¶

The model's performance on the testing dataset g_test is detailed below, including ROC Curve, Confusion Matrix and Classification Report:

16_2layer_dropout_roc

17_2layer_dropout_confusion_matrix

Classification Report - 2 Layers with Dropout
Class Precision Recall F1-Score Support
0 0.55 0.37 0.44 92
1 0.49 0.67 0.57 84
Accuracy 0.51 176
Macro Avg 0.52 0.52 0.50 176
Weighted Avg 0.52 0.51 0.50 176

Back to Content

4.3.2.3 Model evaluation summary¶

Training Metrics

  • Accuracy: $0.58$

Testing Metrics

  • Accuracy: $0.51$

  • Class 1 precision: $0.49$

  • Class 1 recall: $0.67$

  • Class 1 F1 score: $0.57$

  • AUC: $0.51$

Interpretation

  • Moderate learning without extreme overfitting: Training accuracy of $0.63$ suggests the model learned moderately and testing accuracy of $0.55$, indicates limited generalization with some room for performance gains.

  • Aggressive positive detection: High recall for class 1 ($0.67$) paired with moderate precision ($0.49$) shows the model favors identifying positives, possibly at the cost of false alarms.

  • Limited discriminative power: AUC of $0.51$ shows the model still deosn't perform too much better than random guess.

  • While the signal capture is increased from the baseline model, it is somehow at the cost of class 1 precision. We will then explore how the additional layer can help improve the overal performance.

Back to Content

4.4 Variant model B - 3 layer LSTM model without dropout¶

4.4.1 Build the model (3 layers without dropout)¶

After structuring the architecture and tuning the hyperparameters within the planned search space, the optimized model has the following specifications:

  • Layer 1 units : $15$

  • Layer 2 units : $25$

  • Layer 3 units : $5$

  • Learning rate : $0.002$

  • Activation (layer 1): elu

  • Activation (layer 2): elu

  • Activation (layer 3): relu

The illustration below depicts the finalized model structure.

18_3layer_no_dropout

Screenshots of key visuals on TensorBoard results (TIME SERIES tab):

epoch_accuracy
19_3layer_no_dropout_accuracy

epoch_auc
19_3layer_no_dropout_auc

epoch_loss
19_3layer_no_dropout_loss

Intepreatation:

  • In the epoch_accuracy plot, across all the hyperparameter trials, most of them showed increasing validation accuracy, demonstrating similar patterns as the baseline model except taking more epochs. The best-performing run also achieved a validation accuracy of ~$0.68$, similar to the best result from the baseline model as well.

  • The epoch_auc plot shows that most model configurations gradually improved their validation AUC over training epochs. Although the AUCs of some trials started around or below $0.5$ (random guessing), the best trail deliverd AUC ~$0.75$, very close to that from the baseline model.

  • The epoch_loss plot also demonstrates high level similarity to its baseline model counterpart. Among all the trials, approximately half showed steadily decreasing loss values, indicating effective learning. Other trials exhibited oscillating loss curves, suggesting less effective hyperparameter settings.

  • Overall, compared to the baseline model, adding an extra layer to this model had limited impact on the training experience. The only notable change was that the trails required more epochs to reach early stopping.

Back to Content

4.4.2 Evaluate the model (3 layers without dropout)¶

4.4.2.1 Evaluate the training data against the testing data¶

The trained model was evaluated on both the training and testing datasets using g_train and g_test, respectively. The observed accuracies are:

  • Training accuracy: $0.60$
  • Testing accuracy: $0.52$

4.4.2.2 Use the model to generate predictions on the testing data and conduct a more comprehensive performance analysis¶

The model's performance on the testing dataset g_test is detailed below, including ROC Curve, Confusion Matrix and Classification Report:

20_3layer_no_dropout_roc

21_3layer_no_dropout_confusion_matrix

Classification Report - 3 Layers without Dropout
Class Precision Recall F1-Score Support
0 0.53 0.73 0.61 92
1 0.50 0.30 0.37 84
Accuracy 0.52 176
Macro Avg 0.52 0.51 0.49 176
Weighted Avg 0.52 0.52 0.50 176

Back to Content

4.4.2.3 Baseline model evaluation summary¶

Training Metrics

  • Accuracy: $0.60$

Testing Metrics

  • Accuracy: $0.52$

  • Class 1 precision: $0.50$

  • Class 1 recall: $0.30$

  • Class 1 F1 score: $0.37$

  • AUC: $0.52$

Interpretation

  • Moderate learning without extreme overfitting: Training accuracy of $0.60$ suggests the model learned moderately and testing accuracy of $0.52$, indicates limited predictive power.

  • Conservative class 1 prediction: moderate precision ($0.50$) with low recall ($0.30$) suggests the model predicts positives cautiously, missing many true positives.

  • Limited discriminative power: AUC of $0.52$ indicates the model’s ability to distinguish between classes still remains close to chance level.

  • Adding another layer doesn't improve the performance of the model. Even 2 layer model with dropout can demonstrate slighly more value.

Back to Content

4.5 Variant model C - 3 layer LSTM model with dropout¶

4.5.1 Build the model (3 layers with dropout)¶

After structuring the architecture and tuning the hyperparameters within the planned search space, the optimized model has the following specifications:

  • Layer 1 units : $15$

  • Layer 2 units : $10$

  • Layer 3 units : $15$

  • Dropout rate after layer 1 : $0.5$

  • Dropout rate after layer 2 : $0.5$

  • Learning rate : $0.002$

  • Activation (layer 1): elu

  • Activation (layer 2): relu

  • Activation (layer 3): relu

The illustration below depicts the finalized model structure.

22_3layer_dropout

Screenshots of key visuals on TensorBoard results (TIME SERIES tab):

epoch_accuracy
23_3layer_dropout_accuracy

epoch_auc
23_3layer_dropout_auc

epoch_loss
23_3layer_dropout_loss

Intepreatation:

  • In the epoch_accuracy plot, across all the hyperparameter trials, most of them showed increasing validation accuracy. Yet the starting accuracy was quite low from all trails, and the best accuracy $0.64$ was the worst among all the models.

  • The epoch_auc plot shows that most model configurations gradually improve their validation AUC over training epochs. Although the AUC improvement demonstrated similar trends compared to other models, the best value $0.71$ was still the lowest.

  • The epoch_loss panel also reveals two distinct patterns: roughly half of the trials show consistently decreasing loss, suggesting effective learning, while the others display oscillating or upward-trending loss curves — an undesirable behavior not observed in other models. The lowest recorded loss was $0.62$, the worst among all the models again. Most trials reached early stopping near epoch 47, with only two outliers extending to epoch 62.

  • With the extra layer and dropout, this model even delivered worse results with more epochs during the training, suggesting that for this project, additional neurons and regularization may not be a good combination to improve performance.

Back to Content

4.5.2 Evaluate the model (3 layers with dropout)¶

4.5.2.1 Evalute the training data against the testing data¶

The trained model was evaluated on both the training and testing datasets using g_train and g_test, respectively. The observed accuracies are:

  • Training accuracy: $0.53$
  • Testing accuracy: $0.53$

4.5.2.2 Use the model to generate predictions on the testing data and conduct a more comprehensive performance analysis¶

The model's performance on the testing dataset g_test is detailed below, including ROC Curve, Confusion Matrix and Classification Report:

24_3layer_dropout_roc

25_3layer_dropout_confusion_matrix

Classification Report - 3 Layers with Dropout
Class Precision Recall F1-Score Support
0 0.54 0.73 0.62 92
1 0.52 0.32 0.40 84
Accuracy 0.53 176
Macro Avg 0.53 0.52 0.51 176
Weighted Avg 0.53 0.53 0.51 176

Back to Content

4.5.2.3 Model evaluation summary¶

Training Metrics

  • Accuracy: $0.53$

Testing Metrics

  • Accuracy: $0.53$

  • Class 1 precision: $0.52$

  • Class 1 recall: $0.32$

  • Class 1 F1 score: $0.40$

  • AUC: $0.55$

Interpretation

  • Moderate learning with consistency: Training and testing accuracy tie at $0.53$, indicating zero overfitting. But the drop of training accuracy suggests the model doesn't learn as much as previous ones.

  • Moderate precision, low recall for class 1: The model predicts positives fairly accurately (precision: $0.52$ but misses many actual positives (recall: $0.32$), pulling the F1 score to $0.40$.

  • Limited discriminative power: AUC of $0.55$ shows improvement on distinguishing between classes but still underwhelming.

  • While dropout helps reduce overfitting and enhances discriminative power, the overall accuracy shows no improvement, making it difficult to consider this model the best.

Back to Content

4.6 Review of all the models¶

Let's collect all the performance data and put them together in a dataframe for a clear comparision.

training accuracy testing accuracy train/test accuracy gap AUC class 1 precision class 1 recall class 1 F1 score
2 layers without dropout (baseline model) 0.630 0.550 0.080 0.520 0.540 0.390 0.460
2 layers with dropout (variant model A) 0.580 0.510 0.070 0.510 0.490 0.670 0.570
3 layers without dropout (variant model B) 0.600 0.520 0.080 0.520 0.500 0.300 0.370
3 layers with dropout (variant model C) 0.530 0.530 0.000 0.550 0.520 0.320 0.400
  • Baseline model (2 layers without dropout) has the highest training ($0.63$), testing accuracy ($0.55$) and precision ($0.54$), but suffers from overfitting (train/test gap: $0.08$) and low recall ($0.39$) on class 1 (uptrend).

  • Variant model A (2 layers with dropout) shows better class 1 recall ($0.67$) and highest class 1 F1 score ($0.57$), indicating stronger detection of uptrends, despite modest accuracy and AUC. Its slight drop in accuracy is an acceptable trade-off for its predictive strength on the target class. It is the only model with a higher recall than precision, demonstrating the most aggresive predicting style among the 4 models.

  • Variant model B (3 layers without dropout) offers no significant accuracy or AUC improvement and has the lowest class 1 recall ($0.30$) and F1 score ($0.37$), suggesting it's less effective at identifying uptrends. Its train/test accuracy gap and AUC are the same as those of the baseline model, while precision and recall follow a similar pattern as well. It can be seen as a resembling but weaker version of the baseline model.

  • Variant model C (3 layers with dropout) achieves the lowest overfitting ($0.00$ gap) and highest AUC ($0.55$), indicating balanced generalization from dropout and better discriminative power possible contributed by the additional layer. But its recall ($0.32$) and F1 ($0.40$) remain low, indicating weak capability to catch upward movements. Its complexity didn't translate to substantially better classification metrics.

  • Dropout generally improves generalization: Comparing both pairs (2-layer vs. 2-layer with dropout and 3-layer vs. 3-layer with dropout), dropout reduced overfitting and improved recall for detecting uptrends.

  • Adding a third LSTM layer doesn't yield clear benefits: Increasing depth from 2 to 3 layers did not improve accuracy, AUC, or class 1 performance. In fact, the 3-layer models (with or without dropout) showed lower F1 scores ($0.37–0.40$) and poorer recall than the 2-layer with dropout model.

Recommended Model for Backtesting: Baseline model (2 layers without dropout)¶

Adding an extra layer negatively impacted model performance, so neither Variant Model B nor C will be considered the best candidate.

Then, I conducted backtesting for both baseline model and variant model A, and the baseline model showed slightly better performance.

This is because Variant A achieved a relatively high F1 score ($0.57$), driven by strong recall ($0.67$) but limited precision ($0.49$), reflecting an aggressive prediction style that led to more false alarms. Given the high volatility and frequent downturns in Adobe’s share price between 2020 and 2025, these false alarms would introduce significant negative returns that offset the gains from correct predictions.

As a result, a more conservative strategy is better suited for backtesting such a volatile asset. The baseline model, with its higher precision and accuracy, is the preferred choice for final deployment.

Back to Content

5. Trading strategy with backtesting¶

In this chapter, we encapsulate the baseline model (2-layer without dropout) as a trading strategy to assess its performance.

Specifically, we reapply the model to the full Adobe dataset from 2020 to 2025 and evaluate its results using a range of analytical techniques.

5.1 Profit analysis¶

To assess the effectiveness of the baseline model, we will conduct a simple backtesting trading exercise with the 5-year Adobe price data. The trading rules are straightforward:

  • Buy Signal (Predicted = 1): Purchase one share of Adobe at the day's closing price.

  • Sell on the Next Day: If a purchase occurs, the position will be sold at the next day's closing price.

  • No Trade (Predicted = 0): No transaction is executed on this day related to this predicition.

  • No Transaction Costs or Friction Included: This backtest assumes a cost-free trading environment for simplicity.

Trades following these rules will be referred to as the LSTM Strategy in the subsequent analysis.

After a series of calculations, I derived the daily cumulative profit from trading using the LSTM strategy in the past 5 years. For comparison, I also computed the daily cumulative profit following a buy-and-hold approach.

The illustration below visualizes the performance of both strategies over time.

26_profit_analysis

Observations from the profit analysis:

  • LSTM strategy generated $\$1153.52$ per share, while buy-and-hold yielded only $\$36.71$ — a 30x profit improvement.

  • Adobe stock was highly volatile with limited overall growth during the 5 years, leading to frequent losses for buy-and-hold investors.

  • LSTM strategy delivered smoother, more consistent gains with much lower exposure to volatility.

  • Major market drawdowns (e.g., 2022, early 2024, early 2025) had limited impact on LSTM, showing the model's strong downside protection.

Overall, the LSTM strategy not only vastly outperformed the buy-and-hold approach in absolute profit, but also provided a more stable and resilient path through a volatile market environment. Its ability to sidestep downturns while maintaining consistent growth highlights its strength as an active trading strategy.

Back to Content

5.2 Rolling Sharpe ratio analysis¶

The Sharpe ratio measures return relative to risk:

$$\text{Sharpe Ratio}=\frac{\text{Mean Return}}{\text{Standard Deviation of Return}}$$ A rolling Sharpe ratio calculates this over a moving time window (e.g. 126 trading days ≈ 6 months), so we can see how a strategy’s risk-adjusted performance evolves. It highlights the model’s consistency, stability, and adaptability across different market conditions.

In this analysis, we compute the 126-day rolling Sharpe ratio for both the LSTM-based strategy and the buy-and-hold benchmark over the past 5 years, and compare their trajectories as shown in the below illustration to assess relative performance.

27_rolling_sharpe

Observations from the rolling Sharpe ratio comparison:

  • LSTM strategy maintains consistently higher Sharpe ratios than buy-and-hold across all periods.

  • Both strategies follow similar trends, indicating shared market exposure, but LSTM exhibits stronger resilience.

  • LSTM stays largely above zero, reflecting sustained positive risk-adjusted returns.

  • During major drawdowns (e.g., late 2021–mid 2022 and late 2023–mid 2024), the LSTM strategy shows less severe declines and quicker recoveries, indicating better downside control.

Overall, the visualization highlights the LSTM strategy’s great ability to navigate volatile market conditions while preserving return consistency. Its elevated and stable Sharpe ratios shows effective signal learning and risk management.

Back to Content

5.3 Underwater curve analysis¶

An underwater curve is a visual representation of drawdowns over time — it shows how far an investment is below its previous peak.

$$\text{Underwater} = \frac{\text{Current Cumulative Return}}{\text{Historical Peak}} - 1$$ The values are always smaller than or equal to zero (zero means it's at a peak), and more negative means deeper drawdown.

We can still use the daily return information from the LSTM strategy and the buy-and-hold approach to calculate their underwater values in the past 5 years and visualize them accordingly in the below illustration.

28_underwater

Observations from the underwater curve comparison:

  • LSTM strategy exhibits frequent recovery to new highs, keeping its drawdowns shallow and short-lived. Most underwater periods remain above $-20\%$, and recovery often occurs within months.

  • In contrast, the buy-and-hold approach suffers from long and deep drawdowns, including multi-year recovery periods and max drawdowns reaching $-60\%$.

  • The volatility of drawdowns in LSTM is much lower than that of buy-and-hold, indicating better risk control from the baseline model.

  • In recent years, especially from 2022 to 2025, LSTM’s ability to recover quickly after market downturns contrasts sharply with buy-and-hold’s persistent underwater state.

The underwater plot clearly shows that the LSTM strategy provides strong downside protection and quicker recovery compared to the buy-and-hold approach. While both strategies experience drawdowns during turbulent periods, LSTM’s drawdowns are more contained and typically followed by swift rebounds, indicating stronger resilience.

Back to Content

5.4 Pyfolio analysis¶

Pyfolio is a Python library for analyzing portfolio performance and risk management, making it especially valuable for evaluating backtested trading strategies. It provides a comprehensive set of metrics to assess a strategy’s effectiveness.

Below are the metrics provided by Pyfolio to evaluate the backtesting of the LSTM strategy and buy-and-hold approach.

Metrics LSTM strategy Buy-and-hold approach
Start Date 2020-05-14 2020-05-14
End Date 2025-07-01 2025-07-01
Total Months 61 61
Annual Return 62.4% 1.8%
Cumulative Returns 1094.1% 9.4%
Annual Volatility 26.1% 36.0%
Sharpe Ratio 1.99 0.23
Calmar Ratio 2.77 0.03
Stability 0.97 0.00
Max Drawdown -22.5% -60.0%
Omega Ratio 1.63 1.04
Sortino Ratio 3.04 0.31
Skew -0.36 -0.83
Kurtosis 16.43 7.77
Tail Ratio 1.38 0.95
Daily Value at Risk -3.1% -4.5%

Observations from the metrics by Pyfolio:

  • LSTM strategy achieved $62.4\%$ annual return, much higher than $1.8\%$ from the buy-and-hold. The gap of cumulative returns ($1094.1\%$ vs. $9.4\%$) is also quite significant.

  • Sharpe ratio ($1.99$) and Sortino ratio ($3.04$) indicate strong risk-adjusted returns for LSTM; buy-and-hold lags far behind with Sharpe of $0.23$ and Sortino of $0.31$.

  • LSTM volatility is lower ($26.1\%$) compared to buy-and-hold ($36.0\%$), showing more stable performance despite higher returns.

  • Max drawdown is much smaller for LSTM ($-22.5\%$) than buy-and-hold ($-60\%$), signaling better downside protection. This was also discussed in the waterdown curve analysis.

The LSTM strategy demonstrates not just impressive returns, but a good balance of profitability and risk management. It offers a significantly more attractive and robust approach for navigating volatile markets, making it a strong candidate for active trading.

Back to Content

6. Conclusion¶

In this project, I explored the application of LSTM-based deep learning models in predicting upward movements of Adobe stock, and designed a strategy that outperforms traditional approaches in both returns and risk control.

Through a comprehensive process — from sourcing and cleaning data, crafting features using diverse techniques, and performing exploratory analysis with multicollinearity reduction and scaling — to building and tuning baseline and variant models, the workflow reflects a rigorous approach of quantitative modeling.

Beyond the technical outcomes, this project was a valuable learning experience for me as well. It deepened my understanding of deep learning in finance, enhanced my modeling skills, and strengthened my command of Python programming and modern ML libraries. I’m grateful to the CQF program for equipping me with the skills to complete this project, and to my family for their unwavering support during this intensive learning phase.

Back to Content